Time Series Analysis - POC

Preprocess raw data and create sigle time series

Single Time Series for each rail corridor and station

Load Single Time Series

I randomly selected Metrolinx_Lakeshore East_UNION STATION for the analysis.

I splitted the data in different datasets by load_single_timeseries function.

-- The first is data_all - the original time series with ridership numbers, which I will make prediction for feature and number of trips

-- The second one is data_regressors that includes the trip numbers. I will use this as the regressor in the prophet model since the number of trips mighr effect the totoal nmumber of riders.

-- The third one is data, which inclused only number of trips.

For Prophet, wee need three types of dataset. But for classical models, I will just use one of them.

In this time series there three features as date, number of riders, and number of trips.

There are 1451 rows in total. However, based on the start and end dates, there must be 1461 time stamps in a daily dataset.

Therefore, I implemented the following preprocessing on this time series:

1- I filled the missing time stamps with 0 by assuming there was no trip and/or no riders in these days.

2 - I also added one more day to the time series to make it easier the use of lists. The data in this additional row is never used.

3 - I changed data types of ridership and trips columns from obejct to integer numbers.

Exploratory Time Series Analysis

Visualization

In EDA, first, I visualized the time series to have an understanding of the time series.

Trend and Seasonaliy

In time series analysis, I checked two important features: Trend and seasonality.

Time series is suppoased to be stationary that means it must be no trends and seasonality, in another words it must be not a function of time. If time series is stationary, it implies there is no predictable pattern in the long term.

I use Augmented-Dickey Fuller Test to for trend analysis.

As a result of this test if the p-value is bigger than 0.05, it means that time-series data is non-stationary.

Decomposition

In Exploratory data analysis, I also use decomposition of the time series. It shows us trens, seasonality, and residuals.

There are two decomposition methods: Additive and multiplicative.

Additive means the time series is the addition of base values, trend, seasonality, and resiudals while the multpilicative is the multiplication of base values, trend, seasonality, and resiudals.

I only use only additive model here since this time series has 0 values. Multiplicative model does not accept any non-zero or missing values.

ACF and PACF plots also help to observe seasonality.

To make the time series stationary, I use differencing method.

Now, the time series has no unit root and it is stationary.

Train-Test Split

I take 3 years as training data and 1 year for testing

MODEL 1 - ARIMA

I use auto_arima to identify parameters

As the evaluation metrics and plot above suggested, the prediction results are not good. R-quare value is not as expected. Then we can say that ARIMA model is not useful in this time series.

Model 2 - LSTM

LSTM requires a specific format of scaled data. I use MinMax scaler here.

This scaler transfrom data as;

X_std = (X - X.min(axis=0)) / (X.max(axis=0) - X.min(axis=0))

X_scaled = X_std * (max - min) + min

Model 3 - Prophet